--- permalink: /textanalysis/ keywords: fastai description: "Awesome summary" title: Text analysis toc: false branch: master badges: true comments: true categories: [text analysis, sentiment analysis, wordclouds] image: images/some_folder/your_image.png hide: true search_exclude: false metadata_key1: metadata_value1 metadata_key2: metadata_value2 nb_path: notebooks_final\05_Text_Analysis.ipynb layout: notebook ---
This section regarding text analysis is divided into two parts: namely wordclouds and sentiment analysis. Both the extracted wiki pages and the character dialogoues will be used and it will be investigated how wordclouds and sentiment analysis will differ based on the two different data sets.
First, we will take a look at word clouds. As mentioned before, both the extracted wiki pages and the full series dialogoue will be investigated. We will start by generating wordclouds for characters of interest. Here, we have selected the characters: Jon Snow, Arya Stark, Bronn, Brienne of Tarth and Jaime Lannister. The first step in generating the wordclouds is to compute the term frequeny-inverse document frequency (TF-IDF) for our respective text corpus, i.e. the wiki pages and episode dialogoues. For further explanation of the TF-IDF and it's computation we refer to the Explainer Notebook. It should be mentioned that we have removed all characters' names from the text corpus as these would not be very decriptive of the character in a wordcloud or during sentiment analysis.
Now, let's take a look at the generated wordclouds for the selected characters.
When comparing the generated wordclouds for the respective data sets it should be noted, that the same words are, for the most part, not present for the respective characters. This is expected as one would imagine that the text from the characters wikipedia pages are more descriptive of the character and their place in the story whereas the wordcloud from the dialogoue is exactly that; their most descrriptive words according to TF-IDC used throughout the series. This would be interesting to compare with sentiment analysis which is the second part of this page.
Next, we will generate wordclouds based on the characters allegiance. This will be done by pooling the dialogoue text of characters belonging to the same allegiance together and, again, compute the respective TF-IDF score in order to generate the wordclouds. For this, we have selected the houses: Stark, Lannister, Targaryen, Greyjoy and the independent group The Night's Watch. It would be interesting to see, if the houses mottos would appear in these word clouds. The respective house mottos are:
As the Night's Watch is not a House but rather a brotherhood sworn to protect The Wall, they do not have a motto.
When looking at the wordclouds above and the respective house mottos, only the Lannisters' Hear (big, middle) are present. All the wordclouds are, however, very descriptive of the respective houses. For instance for the Night's Watch, a military order sworn to protect The Wall, words like protect, wildling and swear are present. The same can be said for House Targaryan, where the main Targaryan character, Daenerys, is married to a dothraki warlord and later in the show, is a leader of dothraki people herself.
We will now generate wordclouds based on the wiki pages' season sections. It would be interesting to see how these wordclouds change as the story unfolds. It would also be intersting to investigate whether the overall theme of the series changes during the series course and if this can be seen in the wordclouds.
Taking example in the wordclouds generated for season 1 & 8, the emphasized words seem very descriptive of their respective seasons. Starting with season 1:
Comparing the wordclouds of season 1 and season 8, it appears season 8 has different key words. For season 8:
It should also be noted that the word destroy is present in the majority of the wordclouds, only being omitted in the wordclouds for season 1 and 3.
In this second part of text analysis, we will do a sentiment analysis of the characters, again, based on both their wiki-pages and their dialogoue in the series. As we saw in the wordclouds of the selected characters, there was quite a difference in the wordclouds based on the respective wiki-pages and character dialogoue. It would be interesting to look at, if this also results in a different sentiment level of the character. Additionally, we will also do a sentiment analysis of the different seasons of the series. Perhaps it can be determined if any of the seasons were significantly different on a sentiment based level.
For the sentiment analysis, we will apply both the dictionary based method of LabMT and the rule- and dictionary-based method of VADER. For further explanation of how these sentiment scores are computed and the difference between the two methods, we again refer to the Explainer Notebook. It should be noted that the score of the two methods differ, as the LabMT score sentiment on a scale from [1:9], while VADER scores on the range [-1:1]. For LabMT, a score of 5 is considered neutral while a score within the range [-0.05:0.05] is considered neutral for VADER.
This subsection is going to investigate the sentiment of each character based on their character wiki page. We are further going to compare this with the sentiment of characters based on their dialogoue.
From the figure below it can be seen that the two methods, again, do not completely agree on the result but both methods yield approximately the same result. Again the figure displays the 10 happiest and sadest characters based on LabMT and VADER.
At a first glance, it is noticed that the VADER score are lower for the happiest characters than in the previous part whereas the sadest achieve almost the same score. The LabMT results are quite similar in sentiment levels. Again many characters are found in both results such as Septa, Moro, Orell and Polliver.
When comparing with the result based on the character dialogoue not many characters are found in all four results. This could indicate that the wiki-pages and dialogoue does not contain the same information, or that the chosen words on the wiki-pages do not necessarily imply information about the characters sentiment.
It would be expected that the dialogoue contains greater variety of words that can explain the character mood, whereas the wiki-pages would contain words that describe the character and his/hers actions. We also notice that the variation in VADER sentiment scores are far greater when using the dialogoue compared with the wiki-page which could be an indication that our hypothesis are true.
In this subsection we are going to investigate the sentiment of characters based on their dialogoue which is based on transcripts. This is based on all dialogoue across all seasons as this is expected to give a better overview of each character sentiments.
The figure below presents the sentiment of the 10 happiest and 10 sadest characters. To the left the sentiment are based on LabMT whereas the figure to the right is based on VADER.
It should be noted that the two methods does not completely agree, but some characters are present in both results such as: Daisy, Pyat Pree, Olyvar and Matthos Seaworth are in top 10 of the happiest character in both results. Also some characters are present in both lists presenting the sadest characters such as Gregor Clegane.
The happiest characters appear to be quite happy based on the VADER and LabMT score as the score only goes to 1 for VADER and 9 for LabMT and the same for saddest characters.
As a last element in our sentiment analysis we are going to dive into the sentiment of each season. This could help us investigate whether the general mode changes in each season and when combining this with the wordclouds of each season, indicate whether the theme of the series changes as it progressses.
The figure below shows the sentiment of each season based on LabMT and VADER methods. When looking at the LabMT it can be seen that all season are approximately neutral, whereas the VADER scores are just to the sad side of the spectrum. Further, it is noticed that season 4 are the sadest whereas season 6 are the "happiest" when comparing them.
In season 4 a lot of the semi-main characters die such as Prince Oberyn, Joffrey Baratheon, Shay, Tywin Lannister and the Mountain (Gregor Clegane) are transformed into the Monster version of himself. Which could explain why this season is saddest according to the sentiment analysis.
As a last element in our text analysis we are going to investigate some words and how they are used across different seasons, this could again help us understand how the theme of the series evolves. We are going to investigate the use of the selected words by the use of a lexical dispersion plot.
Looking at the lexical dispersion plot above, the first word we chose was winter. This is due to the famous Stark house words being "Winter is coming" and we wanted to investigate how much this phrase was actually used. It appears winter is most used in the beginning and the end of the show. Only sorting for the word winter has the caveat though, that other common phrases such as the long winter are also represented here.
Another interesting comparison is the words dragon and wolf. Both the Targaryens and Starks are refered to as dragons and wolves respectivly but the Stark children also raise their own dire wolf throughout the show. The same can be said for Daenarys whose dragons are born in the end of season 1 and raised througout the show. In the beginning of the show, the wolves are more commonly mentioned compared to the ending where they are barely mentioned. The opposite holds true for dragon which is less mentioned in the beginning but mentioned more and more as the story unfolds.
It can also be seen that the word wedding is mentioned most during season 3 and season 4. This holds true to the story as both Robb Stark, Joffrey Baratheon and Sansa Stark are all married during these seasons.
During this section, we have found the most important words on a character, allegiance and seasonal level. The found words and their importance were represented with wordclouds and it was found that the chosen words represented the character, allegiance and season well. A comparison of the different data sources of wiki pages and show dialogoue has also been made, where it was found that the choice of words in these data sources are significantly different. This also makes sense, as it would be expected that the wikipedia page is more descriptive of the story,its characters and the setting whereas the dialogue would be expected to be a more crude choice of words.
Using sentiment analysis, the happiest and saddest characters of the show were found. Both according to the characters dialogoue but also their wikipedia pages. It was also found that the sentiment of wikipedia pages are different than the sentiment of the dialogoues.
Finally, a lexical dispersion plot was computed in order to visualise how some of the common words of the Game of Thrones world were used throughout the show.